Back

Journal of Biomedical Informatics

Elsevier BV

Preprints posted in the last 90 days, ranked by how well they match Journal of Biomedical Informatics's content profile, based on 45 papers previously published here. The average preprint has a 0.07% match score for this journal, so anything above that is already an above-average fit.

1
MedSDoH: A Rule-Based System for Extracting Social Determinants of Health from Multi-site EHRs Based on the OHNLP Framework

Ahn, J.; Fu, S.; Palacios, D. M.; Jeong, H.-H.; Wang, L.; Swartz, M. C.; Tosur, M.; Redondo, M. J.; Wu, X.; Yue, Z.; Kakadiaris, A.; Wang, N.; Li, Z.; Huang, M.; Wen, A.; Harris, D.; Wang, Y.; Kwak, M. J.; Liu, Z.; Liu, H.

2026-04-29 health informatics 10.64898/2026.04.27.26351699 medRxiv
Top 0.1%
36.7%
Show abstract

ObjectiveSocial Determinants of Health (SDoH) are critical to patient care and population health. Despite their importance, SDoH information is frequently embedded within unstructured clinical text such as patient-reported information or social worker notes, which limits its use on clinical decision-making and resource allocation. Although transformer-based models represent the current state of the art, their scalability, computational requirements, and limited transparency pose barriers to large-scale multi-site clinical implementation. In this context, rule-based NLP systems remain valuable, particularly when explainability, reproducibility, and rapid customization are essential. MethodsMedSDoH was developed within the Open Health Natural Language Processing (OHNLP) Framework using literature-derived SDoH resources, standardized domain definitions, and expert-curated rulesets. Large language models (LLMs) were used during development to assist with rule generation and lexicon expansion. Rules were iteratively refined against a gold-standard annotated corpus from two health systems and then evaluated on independent datasets. ResultThe final system included 942 regular expression rules spanning 22 SDoH domains. On validation on two external datasets, MedSDoH demonstrated generalizability and comparable performance across sites. The system has been made publicly available so research community can collaboratively contribute to the maintenance and extension through disease- or site-specific adaptations. ConclusionMedSDoH is a computationally efficient and open-source system for large-scale SDoH extraction from clinical text. It is well-suited for multi-site adaptation and deployment in resource-constrained settings.

2
Can NLP Detect Loneliness in Electronic Health Records? A Proof-of-Concept Study

Park, T.; Habibi, S.; Lowers, J.; Sarker, A.; Bozkurt, S.

2026-04-11 health informatics 10.64898/2026.04.08.26350462 medRxiv
Top 0.1%
33.1%
Show abstract

Loneliness is clinically important but under-documented in electronic health records (EHRs), posing challenges for secondary use and computational phenotyping. This study evaluated whether natural language processing (NLP) methods can detect and classify loneliness severity from clinical notes. Patients with a loneliness survey (mild, moderate, severe) were identified, and notes within six months prior to the survey were retrieved. An expert-expanded lexicon was applied, and transformer models (RoBERTa, ClinicalBERT, Longformer) were fine-tuned for loneliness severity classification. Large language model-based summarization of social and psychiatric history was also tested as an alternative input representation. Performance was evaluated using accuracy, weighted-F1, and per-class F1. All models achieved modest accuracy (0.3 to 0.7), and struggled to identify severe loneliness, reflecting sparse and inconsistent documentation even among surveyed patients. While summarization marginally improved accuracy, gains primarily reflected mild predictions. Manual review of 100 social worker notes from severely lonely patients found explicit mentions of loneliness in only two cases, confirming that relevant documentation is exceedingly rare. These findings demonstrate that model performance is constrained by the sparse and inconsistent documentation of loneliness in EHRs, rather than by deficiencies in the modeling approach itself.

3
A Heterogeneous Graph Neural Network Framework for Multi-Horizon Stroke Mortality Prediction

Tharzeen, A.; Vafaei Sadr, A.; Radfar, N.; Hwang, W.; Abedi, V.; Zand, R.

2026-06-10 health informatics 10.64898/2026.06.09.26355176 medRxiv
Top 0.1%
28.7%
Show abstract

Background: Machine learning models for stroke mortality prediction typically treat each time horizon independently and use flat tabular features that ignore the relational structure of electronic health records (EHRs). In this pilot study, we leveraged graph-based machine learning models to predict post stroke all-cause-mortality across three different time horizons. Methods: We developed Stroke Temporal Heterogeneous Graph (StrokeTHG), a heterogeneous graph neural network model for simultaneous multi-horizon stroke mortality prediction (30-day, 90-day, 1-year) using EHR data from Penn State Health System. The model encodes various relations among EHR entities (e.g., patient, diagnosis, comorbidity) and temporal encoding of admission time to better predict stroke mortality. We compared our proposed approach against various baseline methods, including Logistic Regression, Random Forest, and XGBoost. We also performed ablation and subgroup analyses, evaluated the quality of learned graph embeddings, and assessed the importance of different edge types in the graph. Results: We included 4,144 stroke patients (mean age 69.2 years; 54.3% men), of whom 3,332 (80.4%) survived their stroke after one year. 30-day, 90-day, and 1-year mortality rates were 9.7%, 13.7%, and 19.6%, respectively. Our proposed approach, StrokeTHG, achieved AUROC of 0.872, 0.878, and 0.837 across horizons, outperforming all tabular baselines. At [≥] , 75% specificity, the model identified 5-10 percentage points more mortality cases than the best baseline at each horizon. Subgroup analysis demonstrated consistent performance across sex subgroups and the largest discriminative gains in the Age 65-80 stratum. Edge-type ablation identified phenotype-patient and admission-patient edges in the constructed EHR graph as the most influential relational edges for mortality prediction. StrokeTHG embeddings outperformed all graph and matrix factorization baselines under an identical downstream classifier, confirming that performance gains stem from representation quality rather than classifier capacity. Conclusions: StrokeTHG demonstrates that heterogeneous graph representations of EHR data provide a consistent improvement over flat tabular models for multi-horizon stroke mortality prediction, with particular advantage at clinically actionable sensitivity thresholds and novel multi-horizon monotonic prediction capability. This methodological framework may be adaptable to other EHR-based clinical research studies seeking to leverage heterogeneous relational structures for predictive modeling.

4
Leveraging State-of-the-Art LLMs for the De-identification of Sensitive Health Information in Clinical Speech

Dai, H.-J.; Mir, T. H.; Fang, L.-C.; Chen, C.-T.; Feng, H.-H.; Lai, J.-R.; Hsu, H.-C.; Nandy, P.; Panchal, O.; Liao, W.-H.; Tien, Y.-Z.; Chen, P.-Z.; Lin, Y.-R.; Jonnagaddala, J.

2026-04-17 health informatics 10.64898/2026.04.13.26349911 medRxiv
Top 0.1%
28.5%
Show abstract

Accurate recognition and deidentification of sensitive health information (SHI) in spoken dialogues requires multimodal algorithms that can understand medical language and contextual nuance. However, the recognition and deidentification risks expose sensitive health information (SHI). Additionally, the variability and complexity of medical terminology, along with the inherent biases in medical datasets, further complicate this task. This study introduces the SREDH/AI-Cup 2025 Medical Speech Sensitive Information Recognition Challenge, which focuses on two tasks: Task-1: Speech transcription systems must accurately transcribe speech into text; and Task-2: Medical speech de-identification to detect and appropriately classify mentions of SHI. The competition attracted 246 teams; top-performing systems achieved a mixed error rate (MER) of 0.1147 and a macro F1-score of 0.7103, with average MER and macro F1-score of 0.3539 and 0.2696, respectively. Results were presented at the IW-DMRN workshop in 2025. Notably, the results reveal that LLMs were prevalent across both tasks: 97.5% of teams adopted LLMs for Task 1 and 100% for Task 2. Highlighting their growing role in healthcare. Furthermore, we finetuned six models, demonstrating strong precision ([~]0.885-0.889) with slightly lower recall ([~]0.830-0.847), resulting in F1-scores of 0.857-0.867.

5
Medicalbench: Evaluating Large Language Models Towards Improved Medical Concept Extraction

Yang, Z.; Lyng, G. D.; Batra, S. S.; Tillman, R. E.

2026-04-16 health informatics 10.64898/2026.04.12.26350704 medRxiv
Top 0.1%
22.9%
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWMedical concept extraction from electronic health records underpins many downstream applications, yet remains challenging because medically meaningful concepts, such as diagnoses, are frequently implied rather than explicitly stated in medical narratives. Existing benchmarks with human-annotated evidence spans underscore the importance of grounding extracted concepts in medical text. However, they predominantly focus on explicitly stated concepts and provide limited coverage of cases in which medically relevant concepts must be inferred. We present MedicalBench, a new benchmark for medical concept extraction with evidence grounding that evaluates implicit medical reasoning. MedicalBench formulates medical concept extraction as a verification task over medical note concept pairs, coupled with sentence level evidence identification. Built from MIMIC-IV discharge summaries and human verified ICD-10 codes, the dataset is curated through a multi stage large language model (LLM) triage pipeline followed by medical annotation and expert review. It deliberately includes implicit positives, semantically confusable negatives, and cases where LLM judgments disagree with medical expert assessments. Annotators provide sentence level evidence spans and concise medical rationales. The final dataset contains 823 high quality examples. We define two complementary evaluation tasks: (1) medical concept extraction and (2) sentence level evidence retrieval, enabling assessment of both correctness and interpretability. Benchmarking state-of-the-art LLMs and a super-vised baseline reveals that performance remains modest, highlighting the difficulty of extracting implicitly expressed concepts. We further show that explicitly incorporating reasoning cues and prompting to extract implicit evidence substantially improves medical concept extractions, while performance is largely invariant to note length, indicating that MedicalBench isolates reasoning difficulty rather than superficial confounders. MedicalBench provides the first systematic benchmark for implicit, evidence-grounded medical concept extraction, offering a foundation for developing medical language models that can both identify medically relevant concepts and justify their predictions in a transparent and medically faithful manner.1

6
CausalKnowledgeTrace: A Novel Computational Framework for Automated Literature-Based Causal Graph Construction and Evidence-Based Variable Selection in Biomedical Research

Upadhayaya, R.; Pradhan, M. M.; Metzger, V. T.; Malec, S. A.

2026-05-12 bioinformatics 10.64898/2026.05.07.723601 medRxiv
Top 0.1%
22.0%
Show abstract

BackgroundVariable selection for causal inference from observational biomedical data is challenging, as overlooking confounders or conditioning on colliders leads to biased estimates. While vast causal knowledge exists in biomedical literature, manually extracting this information for principled variable selection is impractical at scale. MethodsWe developed CausalKnowledgeTrace, a Python-based computational framework with Django web interface that systematically leverages structured causal knowledge from the Semantic MEDLINE Database (SemMedDB) to inform variable selection in causal studies. The system implements a six-stage analysis pipeline using NetworkX for graph operations, including graph parsing, basic analysis, comprehensive cycle detection, systematic generic node removal, post-removal analysis, and formal causal inference with bias detection. ResultsAnalysis of the hypertension-Alzheimers relationship across three degree neighborhoods (1-3) demonstrated systematic scaling of causal complexity: 361-866 variables, 429-1,442 relationships, with graph densities of 0.0033-0.0019. The analysis revealed complex cyclic structures with 54-606 baseline cycles across degree levels. Processing times ranged from 0.3-1.0 seconds for all three degrees, demonstrating computational efficiency for complex biomedical networks. Key confounders identified across all degrees included inflammation, diabetes, insulin resistance, obesity, and ischemia. In the third degree of graph, the pipeline structurally identified 39 confounders, 11 mediators, and 3 colliders from the causal graph. Among the key identified confounders and mediators--including obesity, oxidative stress, ischemia, and vascular diseases--all were found to have strong supporting evidence in established epidemiological and pathophysiological literature. ConclusionsCausalKnowledgeTrace provides a scalable, evidence-based approach to causal graph construction that systematically identifies confounders and bias structures often missed by conventional approaches. The Python-Django architecture enables both standalone analysis and integration into larger computational workflows, representing a significant advance in computational support for causal inference in biomedical research. Statement of SignificanceO_ST_ABSProblem or IssueC_ST_ABSSelecting proper confounders and variables for causal inference from observational biomedical datasets is challenging and often biased by limited expertise or manual review. What is Already KnownExisting approaches rely on domain experts, statistical variable screening, or manual construction of causal graphs, but these often overlook literature-documented confounders and complex biases. What this Paper AddsThis paper introduces an automated, literature-based framework for synthesizing and validating causal graphs, identifying critical variables and complex bias structures, such as M-bias and butterfly bias, with full evidentiary traceability. Who would benefit from the new knowledge in this paper?Epidemiologists, biomedical researchers, informaticians, and clinical investigators seeking reliable and transparent causal modeling for observational studies.

7
The Golden Opportunity or the Cutting Room Floor? Quantifying and Characterizing the Loss and Addition of Social Determinants of Health during Clinician Editing of Ambient AI Documentation

Kim, S.; Guo, Y.; Sutari, S.; Chow, E.; Tam, S.; Perret, D.; Pandita, D.; Zheng, K.

2026-04-22 health systems and quality improvement 10.64898/2026.04.20.26351322 medRxiv
Top 0.1%
19.0%
Show abstract

Social determinants of health (SDoH) are important for clinical care, but it remains unclear how much AI-captured social context is preserved after clinician editing in ambient documentation workflows. We retrospectively analyzed 75,133 paired ambient AI-drafted and clinician-finalized note sections from ambulatory care at a large academic health system. Using a rule-based NLP pipeline, we extracted 21 SDoH categories and quantified retention, deletion, and addition. SDoH appeared in 25.2% of AI drafts versus 17.2% of final notes. At the mention level, AI captured 29,991 SDoH mentions, of which 45.1% were deleted, 54.9% were retained with clinicians adding 3,583 new mentions. Insurance and marital status were most often deleted, whereas substance use and physical activity were more often retained. Deletion patterns also varied by specialty, supporting the need for specialty-aware ambient AI systems.

8
Generation and Evaluation of Realistic Synthetic Clinical Progress Notes for Prostate Cancer using Large Language Models.

Rey-Blanes, A.; Veredas-Morente, J.; Vivas-Vargas, E.; Gil-Garcia, F.; Moreno-Barea, F. J.; Veredas, F. J.

2026-05-28 health informatics 10.64898/2026.05.25.26354027 medRxiv
Top 0.1%
18.8%
Show abstract

Background and Objective: Access to real-world electronic health records (EHRs) remains limited by privacy, governance and annotation constraints, hindering the development of clinical natural language processing models. Realistic synthetic progress notes may provide EHR-like corpora that preserve clinically rigorous information on diagnoses, treatments, symptoms, imaging, laboratory findings and therapeutic trajectories without relying directly on sensitive patient records. This study evaluates whether large language models (LLMs) can generate realistic Spanish prostate cancer progress notes from published case reports, preserving clinical content, temporality and hospital-style conventions.

9
Early Detection of Rare Disease Using Hierarchical Set-to-Sequence Modeling of Structured Electronic Health Records

Ma, Y.; Chinthala, L.; Mohammed, A.; Davis, R. L.; Colonna, V.

2026-05-06 health informatics 10.64898/2026.05.04.26352393 medRxiv
Top 0.1%
18.6%
Show abstract

Rare diseases are characterized by heterogeneous, weak, and sparse phenotypic signals that emerge gradually across longitudinal clinical visits, making early detection a persistent challenge. In this study, we propose a hierarchical set-to-sequence (HSS) framework for prospective rare disease detection using structured EHR data. HSS decomposes the problem into two levels: (1) intra-visit encoding via Multi-Query Attention (MQA), which treats heterogeneous clinical events within a single clinical visit as an unordered set to generate unified visit-level representations, and (2) inter-visit temporal modeling with transformer encoders conditioned on patient visit age and inter-visit time gaps to capture the disease progression and the irregular intervals between clinical visits. We construct a real-world cohort of 40,223 patients comprising 708,422 visits from a single academic medical center (2005-2025), with 3,032 rare disease cases identified by curated rule-based phenotyping including severe neuro-developmental, congenital, or genetic conditions. We formulate the task as multi-horizon prospective binary classification with five prediction horizons of 7, 30, 90, 180, and 365 days prior to first diagnosis. Experimental results show that the proposed HSS model consistently outperforms linear logistic regression, tree-based XGBoost, and Transformer-based baselines at every prediction horizon, ranging from AUROC = 0.893 and AUPRC = 0.601 at 7 days with 5.17% prevalence to AUROC = 0.829 and AUPRC = 0.228 at 365 days with at 3.98% prevalence. Notably, the performance gap between HSS and the strongest competing baseline is largest at the 365 days horizon, indicating stronger advantages for long-horizon prediction where phenotypic signals for rare diseases are weak and sparse. Additional analyses further clarify the contribution of the hierarchical components and confirm the importance of hierarchical modeling. This work contributes to the ongoing development of AI methodologies tailored to rare diseases by introducing a hierarchical framework for early detection using structured longitudinal clinical data.

10
MIMIC-IV-Phenotype-Atlas (MIPA) : A Publicly Available Dataset for EHR Phenotyping

Yamga, E.; Goudrar, R.; Despres, P.

2026-04-24 health informatics 10.64898/2026.04.16.26350888 medRxiv
Top 0.1%
18.4%
Show abstract

Secondary use of electronic health records (EHRs) often requires transforming raw clinical information into research-grade data. A central step in this process is EHR phenotyping - the identification of patient cohorts defined by specific medical conditions. Although numerous approaches exist, from ICD-based heuristics to supervised learning and large language models (LLMs), the field lacks standardized benchmark datasets, limiting reproducibility and hindering fair comparison across methods. MethodWe developed the MIMIC-IV Phenotype Atlas (MIPA) dataset, an adaptation of MIMIC-IV that provides expert-annotated discharge summaries across 16 phenotypes of varying prevalence and complexity. Two independent clinicians reviewed and labeled the discharge summaries, resolving disagreements by consensus. In parallel, we implemented a processing pipeline that extracts multimodal EHR features and generates training, validation, and testing datasets for supervised phenotyping. To illustrate MIPAs utility, we benchmarked four phenotyping methods: ICD-based classifiers, keyword-driven Term Frequency-Inverse Document Frequency (TF-IDF) classifiers, supervised machine learning (ML) models, and LLMs on the task. ResultsThe final MIPA corpus consists of 1,388 expert-annotated discharge summaries. Annotation reliability was high (mean document-level kappa = 0.805, mean label-level kappa = 0.771), with 91% of disagreements resolved through consensus review. MIPA provides high-quality phenotype labels paired with structured EHR features and predefined train/validation/test splits for each phenotype. In the benchmarking case study, LLMs achieved the highest F1 scores in 13 of 16 phenotypes, particularly for conditions requiring contextual interpretation of clinical narrative, while supervised ML offered moderate improvements over rule-based baselines. ConclusionMIPA is the first publicly available benchmark dataset dedicated to EHR phenotyping, combining expert-curated annotations, broad phenotype coverage, and a reproducible processing pipeline. By enabling standardized comparison across ICD-based heuristics, ML models, and LLMs, MIPA provides a durable reference resource to advance methodological development in automated phenotyping.

11
Social Determinants of Health and Chronic Disease Risk Prediction in the All of Us Research Program

Kammer-Kerwick, M.; Dave, Y.; Parekh, V.; McDonald, L.; Watkins, S. C.

2026-03-23 health informatics 10.64898/2026.03.19.26348851 medRxiv
Top 0.1%
18.3%
Show abstract

Social determinants of health (SDoH), the social, economic, and environmental conditions shaping health trajectories, contribute to chronic disease risk comparably to clinical factors, yet most predictive studies model conditions independently, obscuring shared social pathways. Using participant-reported data from the All of Us Research Program (n=259,186), we evaluated the relative contributions of demographic factors and twelve SDoH domains to chronic disease prediction while accounting for the co-occurrence structure of conditions. Hierarchical clustering identified two clinically meaningful outcome clusters: a Mental Health cluster (depression, anxiety, substance use disorder; prevalence = 51.7%) and a Cardiometabolic cluster (heart disease, diabetes, chronic lung disease; prevalence = 78.7%). Gradient boosted models were trained for each cluster under three feature configurations, SDoH only, demographics only, and combined, with performance evaluated using bootstrapped area under the receiver operating characteristic curve (AUC). Combined models achieved the highest discriminative performance for Mental Health (AUC = 0.701, 95% confidence interval: 0.696 - 0.705) and Cardiometabolic (AUC = 0.662, 95% CI: 0.655 - 0.668) outcomes. SDoH features outperformed demographics for Mental Health prediction (AUC = 0.678 vs. 0.655), while performance was comparable for Cardiometabolic outcomes (SDoH = 0.633; demographics = 0.636). Interpretability analysis using SHapley Additive exPlanations (SHAP) identified stress, discrimination, and religion/spirituality as the most influential SDoH domains for Mental Health outcomes; age, neighborhood disorder, and discrimination were primary predictors for Cardiometabolic outcomes. Double machine learning confirmed significant causal effects, with stress showing the largest average treatment effect on Mental Health outcomes (ATE = 0.093, p < 0.001). Interaction analyses revealed 24 significant SDoH-by-demographic interactions, indicating differential SDoH effects across racial/ethnic and gender/sexual minority subgroups. These findings indicate that experiential social factors carry stronger predictive signal for mental health conditions, while Cardiometabolic conditions are more strongly shaped by demographic and structural neighborhood characteristics. Results support condition-specific SDoH screening protocols over universal instruments and targeted social interventions to reduce health disparities. Author SummaryWe developed and tested a four-stage analytical framework to predict chronic disease risk more precisely by combining individual Social Determinants of Health (ones social environments, stress levels, neighborhood conditions, and community connections), with conventional patient demographics such as age, income, and race/ethnicity. Using data from nearly 260,000 participants in the All of Us Research Program, we found that including social and environmental factors meaningfully improve prediction of both mental health conditions (depression, anxiety, and substance use) and cardiometabolic conditions (heart disease, diabetes, and lung disease). Importantly, not all social factors matter equally for all conditions. Mental health outcomes were most strongly shaped by experiential factors (stress, discrimination, and loneliness) while cardiometabolic outcomes were more strongly driven by age and neighborhood characteristics such as disorder and limited access to physical activity. We also found that stress, discrimination, and neighborhood disadvantage have stronger health effects among Black, Hispanic, and gender/sexual minority individuals, pointing to where targeted interventions could reduce persistent health disparities. These findings suggest that clinicians and health systems should move away from one-size-fits-all social needs screening toward condition-specific tools that prioritize the social factors most relevant to the conditions being managed.

12
Fine-Tuning PubMedBERT for Hierarchical Condition Category Classification

Wang, X.; Hammarlund, N.; Prosperi, M.; Zhu, Y.; Revere, L.

2026-04-15 health systems and quality improvement 10.64898/2026.04.13.26350814 medRxiv
Top 0.1%
17.8%
Show abstract

Automating Hierarchical Condition Category (HCC) assignment directly from unstructured electronic health record (EHR) notes remains an important but understudied problem in clinical informatics. We present HCC-Coder, an end-to-end NLP system that maps narrative documentation to 115 Centers for Medicare & Medicaid Services(CMS) HCC codes in a multi-label setting. On the test dataset, HCC-Coder achieves a macro-F1 of 0.779 and a micro-F1 of 0.756, with a macro-sensitivity of 0.819 and macro-specificity of 0.998. By contrast, Generative Pre-trained Transformer (GPT)-4o achieves the highest score of a macro-F1 of 0.735 and a micro-F1 of 0.708 under five-shot prompting. The fine-tuned model demonstrates consistent absolute improvements of 4%-5% in F1-scores over GPT-4o. To address severe label imbalance, we incorporate inverse-frequency weighting and per-label threshold calibration. These findings suggest that domain-adapted transformers provide more balanced and reliable performance than prompt-based large language models for hierarchical clinical coding and risk adjustment.

13
Relationship Extraction for Adverse Drug Events in Clinical Notes Using Large Language Models

Plasek, J. M.; Li, Y.; Amato, M. G.; Foer, D.; Seger, D. L.; Alzaidi, S.; Zhou, H.; Jackson, G. P.; Bates, D. W.; Zhou, L.

2026-06-01 health informatics 10.64898/2026.05.28.26354362 medRxiv
Top 0.1%
14.8%
Show abstract

Background: Adverse drug events (ADEs) are a critical indicator of patient safety but are often documented only in free-text clinical notes. The potential of recent advances in natural language processing (NLP), particularly generative large language models (LLMs), to identify ADEs remains understudied. This study aimed to compare the performance of multiple LLMs in identifying ADE-Drug relationships in inpatient and ambulatory clinical notes. Methods: We used clinical notes from the 2018 National NLP Clinical Challenge (n2c2) ADE dataset (inpatient; n=505) and from outpatient encounters (n=2,555) between October 1, 2018, and December 31, 2019, at a large academic medical center based in New England. Notes were pre-processed into snippets for model input. Evaluated Models included: GPT-4o, GPT-4o-mini, LLAMA 3.3-70B and their instruction fine-tuned variants (including low-rank adapters for LLAMA). Performance was assessed using both strict and relaxed evaluations (precision, recall, and F1) for all models, followed by manual evaluation (exact semantic match, partial match, missing ADE, drug mention only, not a drug, or wrong) of the two best-performing models. Results: GPT-4o and GPT-4o-mini were the top-performing models among those evaluated. GPT-4o consistently outperformed GPT-4o-mini in ADE extraction across both datasets, with higher F1-scores (0.524 vs. 0.381) and a more balanced precision-recall profile. Both models captured ADEs effectively in explicit and complex clinical contexts, although limitations included misclassification of pre-existing allergies and occasional conflation of therapeutic indications with adverse effects. GPT-4o achieved higher exact match coverage and fewer errors across clinical notes, indicating more reliable performance in both inpatient and ambulatory settings. Conclusion: This work establishes a foundation for integrating LLM methods into real-world drug safety surveillance, with direct implications for improving patient safety.

14
Combining Token Classification With Large Language Model Revision for Age-Friendly 4M Entity Recognition From Nursing Home Text Messages: Development and Evaluation Study

Amewudah, P.; Popescu, M.; Farmer, M. S.; Powell, K. R.

2026-04-01 health informatics 10.64898/2026.03.31.26349861 medRxiv
Top 0.1%
14.2%
Show abstract

Background: Secure text messages (TMs) exchanged among interdisciplinary care teams in nursing homes (NHs) contain clinical information that aligns with the Age-Friendly Health Systems 4Ms: What Matters, Medication, Mentation, and Mobility, yet, this information is not captured in any structured form, making it unavailable for systematic monitoring or quality reporting. Automatically extracting 4M information accurately and efficiently from these messages could enable several downstream applications within long term care settings. This task, however, is challenging because of the fragmented syntax, brevity, abbreviations, and informality of TMs. Objective: This study aimed to develop and evaluate a multi-stage 4M Entity Recognition (4M-ER) pipeline that combines a fine-tuned token classifier with large language model (LLM) revision, using only locally deployed open-source models, to improve 4M information extraction from clinical TMs. Methods: We used an expert-annotated dataset of 1,169 TMs collected from interdisciplinary teams across 16 Midwest NHs. The pipeline first identifies candidate text spans using a fine-tuned Bio-ClinicalBERT token classifier. A semantic similarity retriever then selects in-context exemplars to guide an LLM revision in which the LLM (Gemma, Phi, Qwen, or Mistral) performs boundary correction, label evaluation, and selective acceptance or rejection of candidate spans. Baselines for comparison included single-stage zero-shot LLMs, single-stage fine-tuned Bio-ClinicalBERT, and a fine-tuned LLM (Gemma) from a prior study. Ablation studies assessed the contribution of each pipeline stage and the effect of message filtering. Robustness was evaluated across 5 repeated runs. Results: The 4M-ER pipeline outperformed the previously fine-tuned Gemma LLM across all 4M domains, achieving F1 (entity type) improvements of +2 to +11 percentage points without any additional fine-tuning and at roughly half the GPU memory (12 vs 24 GB). It also improved upon single-stage fine-tuned Bio-ClinicalBERT in Mobility, Mentation, and What Matters (+0.02 to +0.05 F1). Error analysis showed that LLM revision reduced false positives by 25% to 35% by correcting misclassifications caused by conversational ambiguity, while the fine-tuned Bio-ClinicalBERT's high recall captured subtle entities that the fine-tuned Gemma missed. Silver data augmentation further improved the hardest domains, raising What Matters F1 from 0.59 to 0.67 and Mobility from 0.64 to 0.67. Ablation studies confirmed that restricting LLMs to revision only yielded optimal accuracy and efficiency. Conclusions: The 4M-ER pipeline enables accurate and scalable extraction of 4M entities from clinical TMs by combining fine-tuned Bio-ClinicalBERT with LLM revision using only locally deployed open-source models. The structured 4M data produced by the pipeline can support 4M taxonomy and ontology construction, as demonstrated in the prior work, and provides a foundation for downstream applications including real-time clinical surveillance, compliance with emerging age-friendly quality measures, and predictive modeling in long-term care settings.

15
Reproducibility and Robustness of Large Language Models for Mobility Functional Status Extraction

Liu, X.; Garg, M.; Jeon, E.; Jia, H.; Sauver, J. S.; Pagali, S. R.; Sohn, S.

2026-04-05 health informatics 10.64898/2026.04.03.26350117 medRxiv
Top 0.1%
14.1%
Show abstract

Clinical narrative text contains crucial patient information, yet reliable extraction remains challenging due to linguistic variability, documentation habits, and differences across care settings. Large language models (LLMs) have shown strong accuracy on clinical information extraction (IE), but their reproducibility (stability under repeated runs) and robustness (stability under small, natural prompt variations) are less consistently quantified, despite being central to clinical deployment. In this study, we evaluate three open-weight LLMs representing distinct modeling choices: a dense general-purpose model (Llama 3.3), a mixture-of-experts (MoE) general-purpose model (Llama 4), and a domain-tuned medical model (MedGemma). We focus on binary clinical IE aligned with four mobility classes from the International Classification of Functioning, Disability and Health (ICF) framework. Using a controlled experimental design, we quantify (1) intra-prompt reproducibility across repeated sampling and (2) inter-prompt robustness across paraphrased prompts. We jointly report predictive performance (F1-score) and stability (Fleiss' Kappa [{kappa}]). And we test factor effects using three-way ANOVA with post-hoc comparisons. Results show that increasing temperature generally degrades agreement, but the magnitude depends on model and task; furthermore, prompt paraphrasing can substantially reduce stability, with particularly large drops for the MoE model. Finally, we evaluate a practical mitigation, self-consistency via majority voting, which improves {kappa} substantially and often improves or preserves F1-score, at the cost of additional inference. Together, these findings provide a reproducible framework and concrete recommendations for evaluating and improving LLM reliability in clinical IE.

16
Towards reproducible multimorbidity clustering in electronic health records: a transparent pipeline for aligning research aims and methodology

Romero Moreno, G.; Restocchi, V.; De Ferrari, L.; Palmer, J.; Fleuriot, J. D.; Guthrie, B.; Lone, N. I.

2026-05-26 health informatics 10.64898/2026.05.25.26353178 medRxiv
Top 0.1%
13.8%
Show abstract

The availability of electronic health records has facilitated data-driven approaches to the understanding of multimorbidity, with clustering becoming a common tool for uncovering relevant groups of associated conditions. Previous studies, however, have found challenges in their reproducibility, with wide disparity in the reported clusters. At the core of this issue lays a vagueness of the definition of a cluster, leading to a lack of standards in their methods and evaluation, while implementation details are often not completely reported or explicit in their assumptions. We present a methodological pipeline that can be adapted to different cluster definitions (e.g. multiple cluster membership or clusters where all nodes are mutually associated) and a set of scores that can be composed into an evaluation metric that explicitly incorporates assumptions that align with the research aims. We apply our pipeline to a healthcare dataset of over 7 million patients in England and show how clusters may drastically differ when varying the parameter choices, exposing the risks of reporting a single clustering realisation. Our methodological pipeline, evaluation framework, and tools for analysis and network visualisation serve as a reference to transparently explore and align methodological decisions to the aims of multimorbidity clustering, contributing to overcome the reproducibility challenges of the field.

17
Longitudinal information extraction from clinical notes in rare diseases: an efficient approach with small language models

Wang, X.; Faviez, C.; Vincent, M.; Andrew, J. J.; Le Priol, E.; Saunier, S.; Knebelmann, B.; Zhang, R.; Garcelon, N.; Burgun, A.; Chen, X.

2026-03-31 health informatics 10.64898/2026.03.30.26349388 medRxiv
Top 0.1%
12.9%
Show abstract

Objectives Rare diseases often require longitudinal monitoring to characterise progression, yet much clinical information remains locked in unstructured electronic health records (EHRs). Efficient recovery of such data is critical for accurate prognostic modelling and clinical trial preparation. We aimed to develop and evaluate a small language model (SLM)-based pipeline for extracting longitudinal information from French clinical notes of patients with rare kidney diseases. Methods As a use case, we focused on serum creatinine, a key biomarker of kidney function. We analyzed 81 clinical notes comprising 200 measurements (triplet of date, value and unit). Four open-source SLMs (Mistral-7B, Llama-3.2-3B, Qwen3-4B, Qwen3-8B) were systematically tested with different prompting strategies in French and English. Outputs were post-processed to standardize formats and resolve inconsistencies, and performance was assessed across model size, prompting, language, and robustness to text duplication. Results All SLMs extracted structured triplets, with F1-scores ranging from 0.519 to 0.928 (Qwen3-8B), outperforming the rule-based baseline. Larger models generally performed better, while prompting strategy and language had modest effects across models. SLMs also showed variable robustness to duplicated content common in real-world EHR notes. Discussion Lightweight, locally deployable language models can accurately extract longitudinal biomarkers from unstructured clinical notes. Our findings highlight their practicality for rare diseases where data scarcity often limits task-specific model training. Conclusion SLMs provide a privacy-preserving and resource-efficient solution for recovering longitudinal biomarker trajectories from unstructured notes, offering potential to advance real-world research and patient care in rare kidney diseases.

18
Augmenting Structured Diagnoses through Effective Use of Pre-trained Large Language Models on Clinical Notes

Razzaghi, H.; Nguyen, N.; Pargi, M.; Wieand, K.; Bunnell, T.; Bailey, C.

2026-06-02 health informatics 10.64898/2026.05.30.26354533 medRxiv
Top 0.1%
12.6%
Show abstract

Objective Clinical narrative provides a unique window into provider reasoning and attribution, but use has been limited by resource requirements and extensive fine-tuning, and LLMs in particular have traditionally not performed well at medical coding. We optimize and evaluate a reproducible method for automated diagnosis assignment using LLMs in clinical notes and compare with EHR structured diagnoses. Methods We used GPT-OSS for prompt engineering and task segmentation to create a model that extracts ICD-10-CM diagnoses, with estimates of severity, currency, and importance, from progress notes. We assessed performance across multiple cohorts of patients aged 0-21 years. For each, 100 outpatient provider notes were selected across levels of severity, along with coded diagnoses from that visit (EHR); a subset of 130 notes were subjected to clinical expert review. Results Comparison showed 18.7% exact code and 33.3% ICD-10-CM category match between EHR and LLM, but semantic similarity of 0.93 at the category level. Compared to expert review, LLM precision was 0.84 and recall 0.49 for exact matches, and 0.92 and 0.62, respectively, for category-level matching. In contrast, EHR coded diagnoses showed slightly higher precision (0.94 for both cases) and substantially lower recall (0.27 and 0.43) versus expert review. Codes not identified by the LLM were more often rated by the reviewer as lower importance or certainty. Conclusion We demonstrate a reusable approach to optimizing a pretrained LLM for use in diagnosis extraction from clinical notes, facilitating large-scale diagnosis screening by LLMs without the need for expensive study-specific model refinement.

19
BRIDGE: a barrier-informed Bayesian Risk prediction model for risk IDentification, trajectory Grouping, and profiling of non-adherencE to cardioprotective medicines in primary care

Koh, H. J. W.; Trin, C.; Ademi, Z.; Zomer, E.; Berkovic, D.; Cataldo Miranda, P.; Gibson, B.; Bell, J. S.; Ilomaki, J.; Liew, D.; Reid, C.; Lybrand, S.; Gasevic, D.; Earnest, A.; Gasevic, D.; Talic, S.

2026-04-22 pharmacology and therapeutics 10.64898/2026.04.21.26351387 medRxiv
Top 0.1%
12.5%
Show abstract

BackgroundNon-adherence to lipid-lowering therapy (LLT) affects up to half of patients and contributes substantially to preventable cardiovascular morbidity and mortality. Existing measures, such as the proportion of days covered, provide cross-sectional summaries but fail to capture the dynamic patterns of adherence over time. Although group-based trajectory modelling identifies distinct longitudinal adherence patterns, no approach currently predicts trajectory membership prospectively while incorporating patient-reported barriers. We developed BRIDGE, a barrier-informed Bayesian model to predict adherence trajectories and identify their underlying drivers. MethodsBRIDGE incorporates patient-reported barriers as structured prior information within a Bayesian framework for adherence-trajectory prediction. The model was designed not only to estimate which patients are likely to follow different adherence trajectories, but also to generate clinically interpretable probability estimates that help explain why those trajectories may arise and what modifiable factors may be most relevant for intervention. ResultsBRIDGE achieved a macro AUROC of 0.809 (95% CI 0.806 to 0.813), comparable to random forest (0.815 (95% CI 0.812 to 0.819)) and XGBoost (0.821 (95% CI 0.818 to 0.824)), two widely used machine-learning benchmarks for structured clinical prediction. Calibration was superior to random forest (Brier score 0.530 vs 0.545; ), and performance was stable across six independent training runs (AUROC SD = 0.003). Incorporating barrier-informed priors improved accuracy by 3.5% and calibration by 5.5% compared to flat priors, showing that incorporation of patient-reported barriers added value beyond electronic medical record data alone. Four clinically distinct adherence trajectories were identified: gradual decline associated with treatment deprioritisation amid polypharmacy (10.4%), early discontinuation linked to asymptomatic risk dismissal (40.5%), rapid decline associated with intolerance (28.8%), and persistent adherence (20.2%). Counterfactual analysis identified trajectory-specific intervention levers. ConclusionsBRIDGE provides accurate and well-calibrated prediction of adherence trajectories while offering clinically actionable insights into their underlying drivers. By integrating patient-reported barriers with routine clinical data, the model supports targeted, mechanism-informed interventions at the point of prescribing to improve adherence to cardioprotective therapies. FundingMRFF CVD Mission Grant 2017451 Evidence before this studyWe searched PubMed and Scopus from database inception to December 2025 using the terms "medication adherence", "trajectory", "prediction model", "Bayesian", "lipid-lowering therapy", and "barriers", with no language restrictions. Group-based trajectory modelling has consistently identified three to five adherence patterns across cardiovascular cohorts; however, these applications have been descriptive rather than predictive. Machine-learning models for adherence prediction achieve moderate discrimination but treat adherence as a binary or continuous outcome, thereby overlooking the clinically meaningful heterogeneity captured by trajectory approaches. One prior study applied a Bayesian dynamic linear model to examine adherence-outcome associations, but it did not predict adherence trajectories or incorporate patient-reported barriers. To our knowledge, no published model integrates patient-reported barriers into trajectory prediction. Added value of this studyBRIDGE is, to our knowledge, the first model to incorporate patient-reported adherence barriers as hierarchical domain-informed priors within a Bayesian framework for trajectory prediction. Using 108 predictors derived from routine electronic medical records, the model achieves discrimination comparable to state-of-the-art machine-learning approaches while additionally providing uncertainty quantification, barrier-level interpretability, and counterfactual insights to inform intervention strategies. The identified trajectories differed not only in adherence level but also in switching behaviour, drug-class evolution, and medication burden, suggesting distinct underlying mechanisms of non-adherence that may require tailored clinical responses. Implications of all the available evidenceEach adherence trajectory implies a distinct intervention target: asymptomatic risk communication for early discontinuers (40.5% of patients), proactive tolerability management for rapid decliners, medication simplification for patients with gradual decline associated with polypharmacy, and maintenance support for persistent adherers. By integrating routinely collected clinical data with patient-reported barriers, BRIDGE can be deployed within existing primary care EMR infrastructure to generate actionable, trajectory and patient--specific recommendations at the point of prescribing, helping to bridge the gap between adherence measurement and targeted adherence management.

20
Synonym Augmentation for Rare Disease Identification in Unstructured Data

Valinejad, J.; Moon, S.; Xu, Y.; Zhu, Q.

2026-05-13 health informatics 10.64898/2026.05.11.26352910 medRxiv
Top 0.1%
12.5%
Show abstract

The significant challenges associated with rare diseases in the medical and research domains include the scarcity of information, which is often confined to unstructured formats. Although existing approaches provide valuable insights, there is a need to develop effective methods to identify information pertinent to rare diseases for advancing rare disease research. We identified mentions of rare diseases in relevant texts and assessed their relevance using derived scores, the confidence score and semantic similarity from a fine-tuned BioMedBERT encoder. This encoder was fine-tuned using rare disease related text from Online Mendelian Inheritance in Man (OMIM), Orphanet, a manually validated dataset, and STS benchmark datasets. The process of identifying meaningful rare disease mentioned was presented through two case studies that retrieved relevant NIH-funded projects, utilizing a generated knowledge graph in Neo4j to host data on 2,067 GARD diseases with over 320,000 NIH funded projects. Through various case studies with NIH-funded projects related to rare diseases, we demonstrated the effectiveness of our approach in systematically providing rare disease related data to enhance our understanding of rare diseases for future investigations.